Labeling training data is increasingly the largest bottleneck in deployingmachine learning systems. We present Snorkel, a first-of-its-kind system thatenables users to train state-of-the-art models without hand labeling anytraining data. Instead, users write labeling functions that express arbitraryheuristics, which can have unknown accuracies and correlations. Snorkeldenoises their outputs without access to ground truth by incorporating thefirst end-to-end implementation of our recently proposed machine learningparadigm, data programming. We present a flexible interface layer for writinglabeling functions based on our experience over the past year collaboratingwith companies, agencies, and research labs. In a user study, subject matterexperts build models 2.8x faster and increase predictive performance an average45.5% versus seven hours of hand labeling. We study the modeling tradeoffs inthis new setting and propose an optimizer for automating tradeoff decisionsthat gives up to 1.8x speedup per pipeline execution. In two collaborations,with the U.S. Department of Veterans Affairs and the U.S. Food and DrugAdministration, and on four open-source text and image data sets representativeof other deployments, Snorkel provides 132% average improvements to predictiveperformance over prior heuristic approaches and comes within an average 3.60%of the predictive performance of large hand-curated training sets.
展开▼